Preamble

This is an IPython Notebook to accompany the paper entitled 'The Architecture of Genome Wide Association Studies' by Melinda Mills and Charles Rahal. This allows us to share data analysis done in the construction of this data-driven review of all GWAS to date. It can also be used to dynamically replicate the analysis herein going forward. For installation guidelines, please see the readme.md which accompanies this repository. A preliminary point to note is that if you wish to run this .ipynb file itself, you may need to override the default settings of some versions of Jupyter Notebook (4.2 to 5.1) by opening with:

jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000

or editing your jupyter_notebook_config.py file. Lets begin by loading in our favourite python modules, ipython magic and a bunch of custom functions we've written specifically for this project:

In [1]:
import networkx as nx
import os
import pandas as pd
import re
import seaborn as sns
import sys
import numpy as np
import csv
import itertools
import warnings
import gender_guesser.detector as gender
from Bio import Entrez
from IPython.display import HTML, display
from Support.LoadData import (
    EFO_parent_mapper,
    load_gwas_cat,
    load_pubmed_data, 
    make_timely,
    make_clean_CoR,
    download_cat
)
from Support.PubMed import (
    build_collective,
    build_author,
    build_funder,
    build_abstract,
    build_citation
)
from Support.Analysis import (
    simple_facts,
    ancestry_parser,
    make_meanfemale_andranks,
    make_funders,
    mapped_trait_summary
)
from Support.Figures import (
    gwas_growth,
    choropleth_map,
    wordcloud_figure,
    plot_heatmap,
    plot_bubbles,
    boxswarm_plot
)
from Support.Additional import clean_names
from Support.Ancestry import ancestry_cleaner

warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'png'
%matplotlib inline
%load_ext autoreload
%autoreload 2

Then, let's dynamically grab the three main curated datasets from the GWAS Catalogue EBI website that we will need for our endeavours ('All Associations','All Studies', 'All Ancestries') and the EFO to Parent trait mapping file from their FTP site:

In [2]:
output_path = os.path.abspath(os.path.join('__file__',
                                    '../..',
                                    'data',
                                    'Catalogue',
                                    'Raw'))
ebi_download = 'https://www.ebi.ac.uk/gwas/api/search/downloads/'
download_cat(output_path, ebi_download)

Lets link the PUBMEDID variable to the PUBMED API and get a series of datasets from that using the support functions written in PubMed.py. Note: collective corresponds mostly consortia and study groups.

In [3]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
id_list = set(Cat_Studies['PUBMEDID'].astype(str).tolist())
Entrez.email = 'your@email.com'
papers = Entrez.read(Entrez.efetch(db='pubmed',retmode='xml', id=','.join(id_list)))
build_collective(papers)
build_author(papers)
build_funder(papers)
build_abstract(papers)
build_citation(id_list,Entrez.email)
Number of Collectives Found: 1573!
Authors with last names but no forenames: 8 out of 122405
Built a database of Authors from list of PUBMEDID IDs!
Built a database of Funders from list of PUBMEDID IDs!
Built a database of Abstracts from list of PUBMEDID IDs!
Processing for Citations: : 100%|████████████████████████████████████████████████| 3639/3639 [2:30:17<00:00,  2.48s/it]

Introduction

Some simple 'facts'

Lets do some basic descriptive analysis to see what we've got here:

In [4]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
simple_facts(Cat_Studies, Cat_Ancestry,
             Cat_Ancestry_groupedbyN, Cat_Full,
             AuthorMaster, mostrecentgwascatdate='2018-10-29')
There are currently: 3639 GWAS papers published (in terms of unique PUBMEDIDs)
The first observed GWAS in the catalogue was dated: 2005-03-10
However, only: 10 papers were published in 2005 and 2006 combined
There are currently: 5849 unique Study Accessions
There are currently: 3508 unique Diseases/Traits studied
These correspond to: 2532 unique EBI "Mapped Traits"
The total number of Associations found is currently: 89588
The average number of Associations is currently: 15.3
Mean P-Value for the strongest SNP risk allele is currently: 1.3729e-06
The number of associations reaching the 5e-8 threshold is 49451
The journal to feature the most GWAS studies since 2005-03-10 is: Nat Genet
However, in 2017, Nat Commun published the largest number of GWAS papers
So far, in 2018, Nat Genet published the largest number of GWAS papers
Largest Accession to date: 1030836.0 people.
This was published in Nat Genet.
The first author was: Nielsen JB.
Total number of SNP-trait associations is 89588.
Total number of journals publishing GWAS is 474.
The study with the largest number of authors has: 559 authors.
The current publication lag in the Catalog is: 47 days
The lag between the date it was most recently published and date  of last included GWAS: 46 days

Can the abstracts give us any clues?

Lets make a nice infographic and do some basic NLP on the abstracts from all PUBMED IDs. This figure is the header on the readme.md:

In [5]:
abstract_count = os.path.abspath(
                 os.path.join('__file__', '../..', 'data', 'PUBMED',
                              'Pubmed_AbstractCount.csv'))
output_file = os.path.abspath(
              os.path.join('__file__', '../../', 'figures', 'pdf',
                           'helix_wordcloud_1250_5000_black.pdf'))
wordcloud_figure(abstract_count, output_file)

The file Pubmed_AbstractCount in PUBMED (subdirectory of Data) details a breakdown of the abstract counts. Check counts for 'European', 'Asian', etc.

The growth in depth and scope of GWAS over time

We can also visualise the ever increasing sample sizes and the popularity of GWAS. The top panel shows the increasing number of GWAS conducted over time. We also see increasingly large sample sizes: i.e. the fraction of 'Big N' sample size studies increases. The bottom left subplot shows that with this growth comes bigger N and a high correlatation with the number of significant Associations found. The right subplot shows the unique number of journals publishing GWAS per year and the unique number of diseases or traits studied. Both steadily increase as the appeal of GWASs broadens.

In [6]:
output_file = os.path.abspath(
              os.path.join('__file__', '../../', 'figures', 'svg',
                           'Figure_1.svg'))
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
gwas_growth(output_file, Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN)

Participants

Ancestry

This section of analysis only uses data contained within the catalogue, but is slightly deeper than Section 2 above and other related papers in the literature. We use the 'Broad Ancestral Category' field and aggregate from 135 combinations of 17 ancestries to 7 'broader ancestral categories' to calculate a comparable measure to Popejoy and Fullerton. To get a more detailed analysis of polyvocality, we load in some of our supporting functions to clean up the "INITIAL SAMPLE SIZE" and "REPLICATION SAMPLE SIZE" free-text fields in the catalogue and then calculate something analogous to Panofsky and Bliss. The supporting functions are based on regular expression and data wrangling techniques which exploit patterns in the these free-text fields. By way of a very simple example: “19,546 British ancestry individuals from 6863 families.” will get cleaned to two seperate fields: “19,546” and “British” which can then be used for further downstream analysis. A slightly more complex example: “2,798 European ancestry individuals, 228 French Canadian founder population individuals” will correspond to two entries of 2798 and 228 in the new ‘Cleaned N’ type variable, corresponding to ‘European’ and ‘French Canadian’ in the ‘Cleaned Ancestry’ type variable respectively. Uncomment the appropriate lines of text to get the full lists.

In [7]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
Cat_Studies['InitialClean'] = Cat_Studies.apply(
    lambda row: ancestry_cleaner(row, 'INITIAL SAMPLE SIZE'), axis=1)
output_path = os.path.abspath(
              os.path.join('__file__',
                           '../..',
                           'data',
                           'Catalogue',
                           'Synthetic',
                           'new_initial_sample.csv'))
ancestry_parser(output_path, 'InitialClean', Cat_Studies)
Cat_Studies['ReplicationClean'] = Cat_Studies.apply(
    lambda row: ancestry_cleaner(row, 'REPLICATION SAMPLE SIZE'), axis=1)
output_path = os.path.abspath(
              os.path.join('__file__',
                           '../..',
                           'data',
                           'Catalogue',
                           'Synthetic',
                           'new_replication_sample.csv'))
ancestry_parser(output_path, 'ReplicationClean', Cat_Studies)
In [8]:
clean_intial = pd.read_csv(os.path.abspath(
                           os.path.join('__file__', '../..', 'data',
                                        'Catalogue', 'Synthetic',
                                        'new_initial_sample.csv')),
                           encoding='utf-8')
clean_initial_sum = pd.DataFrame(
    clean_intial.groupby(['Cleaned Ancestry']).sum())
clean_initial_sum.rename(
    columns={'Cleaned Ancestry Size': 'Initial Ancestry Sum', }, inplace=True)
clean_initial_count = clean_intial.groupby(['Cleaned Ancestry']).count()
clean_initial_count.rename(
    columns={'Cleaned Ancestry Size': 'Initial Ancestry Count', }, inplace=True)
clean_initial_merged = clean_initial_sum.merge(pd.DataFrame(
    clean_initial_count['Initial Ancestry Count']), how='outer', left_index=True,
    right_index=True)
clean_initial_merged = clean_initial_merged.sort_values(
    by='Initial Ancestry Sum', ascending=False)
holder = ''
for index, row in clean_initial_merged.iterrows():
    holder = holder + \
        str(index) + ' (' + str(row['Initial Ancestry Sum']) + \
        ',' + str(row['Initial Ancestry Count']) + '), '
print('There are a total of ' + str(len(clean_initial_merged)) +
      ' ancestries found in the \'INITIAL SAMPLE SIZE\' column.' +
      '\nThese are: (number of people used used in all studies, ' +
      'number of studies included):\n\n' + holder[:-2] + '.\n\n')
There are a total of 212 ancestries found in the 'INITIAL SAMPLE SIZE' column.
These are: (number of people used used in all studies, number of studies included):

European (91352174,5890), Japanese (6520029,605), British (4617429,58), Icelandic (2290333,20), East Asian (1633062,208), African American (1631302,720), Finnish (927347,165), African (670551,161), Korean (570712,246), Han Chinese (542531,392), Hispanic (538219,252), European or East Asian (529946,4), South Asian (497157,53), Asian (306532,145), Hispanic/Latino (290293,63), Chinese (280859,169), Indian Asian (181580,44), Latin American (162910,42), African American/Afro Caribbean (161018,36), Indian (132082,54), African American or Afro Caribbean (124245,12), Hispanic/Latino American (116226,12), Hispanic or Latino (83557,4), European and Asian (79845,2), Dutch (75049,20), Taiwanese Han Chinese (72100,14), Swedish (71116,24), Filipino (68299,45), Sardinian (66472,24), European American (63371,37), Northern European (59894,17), Hispanic or Latin American (55030,16), Latino (54928,41), Pomak (53847,31), Rucphen (51402,28), Bangladeshi (49466,36), Hispanic/Latin American (48109,7), French (47074,63), Malay (46437,26), Mylopotamos (45756,31), Mexican (42743,25), German (42145,16), Pakistani (38610,11), European and East Asian (35524,2), European and South Asian (30482,2), Orcadian (30241,27), Mexican American (29291,42), Asian Indian (26798,9), Black (24820,29), Ashkenazi Jewish (23674,34), European and Hispanic (22604,1), European and African American (22593,2), Old Order Amish (21338,40), Italian (20264,12), Val Borbera (19754,15), European and Indian Asian (17967,1), Saudi Arab (17030,8), Singaporean Chinese (14920,24), South East Asian (14708,12), Northern Finnish (14289,3), Hispanic American (13835,16), Northwestern European (13483,2), British and Irish (11375,2), Western European (11043,3), African British (10206,4), Sub Saharan African (9377,25), Brazilian (8761,19), Korkulan (8512,12), African American and African (8298,2), Danish (8247,4), Mexican and Latin American (8214,2), Thai (8143,18), Turkish (7638,14), French Canadian (7508,13), Austrian (7407,9), Ugandan (7110,8), Puerto Rican (7096,9), Sorbian (6982,8), Mainland Hispanic/Latino (6734,1), Fruili Venezia Giulia (6634,7), European and  Rucphen (6597,1), Vietnamese (6469,4), Polish (6413,13), Lebanese (6308,6), Norwegian (5704,6), Caribbean Hispanic (5607,4), Caribbean Hispanic/Latino (5458,1), Kuwaiti (5412,4), Yoruban (5115,7), Gambian (5038,4), Amish (5035,10), Latino American (4912,4), South Tyrol (4728,5), Scottish (4666,2), Latino or African (4658,9), Punjabi Sikh (4619,5), Southern Chinese (4582,3), Indo European (4238,4), Kosraen (4168,2), Taiwanese (3946,10), Hutterite (3776,9), American Indian (3704,12), Carlantino (3667,10), Nigerian (3564,3), West African (3434,5), South Indian (3265,4), Pima Indian (3251,8), Afro Caribbean (3201,4), Samoan (3072,1), African American or European (3054,2), Tibetan (3043,2), Tanzanian (3041,5), African American/African Caribbean (3015,4), Kenyan (2857,2), Native American North American (2835,2), North Indian (2692,4), Sereer (2640,5), Micronesian (2346,1), Celtic (2307,1), Oceania (2229,2), Cilento (2163,3), Indonesian (2041,5), Sirente (2002,7), Middle Eastern (1950,2), Black Hispanic (1940,4), Native Hawaiian (1902,6), Moroccan (1851,9), Japanese American (1697,3), African American and African Caribbean (1612,1), Surinamese (1553,7), Talana (1488,3), African and Asian (1434,8), African and African Arab (1215,4), Spanish (1193,4), Sri Lankan Sinhalese (1154,4), Slavic (1102,1), Southern European (1090,1), Latin/South American (1050,2), Split (986,2), Croatian (959,1), African Caribbean (929,2), Ogliastran (897,1), Ethiopian (895,6), Native American South American (875,2), African Arab (853,4), Arab (836,6), Mongolian (756,1), Martu Australian Aboriginal (752,3), Central European (744,1), South African (733,2), Uyghur (721,1), Tyrolean (696,1), Mixed (684,1), South African Black (637,2), Thai Chinese (618,2), Papua New Guinean (576,2), Bulgarian (564,2), European American and Mexican American (540,1), Vietnamese Korean (518,1), Costa Rican (506,2), Jewish (497,2), Isreali Arab (496,6), Middle Eastern/North African (482,4), Taiwanese Han (470,2), Irish (432,2), Malaysian Chinese (371,2), Russian (366,6), Iranian (352,2), Silk Road (348,1), Finnish Saami (347,1), Maghrebian (346,2), Jewish Isreali (331,2), North African (329,7), Plains American Indian (322,1), Fijian Indian (319,2), Hong Kong Chinese (315,1), Norfolk Island (285,2), Greater Middle Eastern (281,2), Asian American (261,8), South African Afrikaner (251,2), Hispanic/Native American (237,1), Tatar (231,2), Southern Indian (229,2), Malawian (226,2), Tunisian (221,2), Portuguese (221,2), Asian or Pacific Islander (207,3), Arab Isreali (189,1), Hispanic and European (180,2), Hispanic/Native (164,1), Bashkir (161,2), Han Chinese American (156,2), Hispanic Caucasian (136,2), Nepalese (99,2), Oriental (90,1), Native American (89,7), Solomon Islander (85,2), Uygur Chinese (83,1), Dai Chinese (58,1), European and Mexican (41,2), Jingpo Chinese (40,1), Romanian (32,1), European/Asian (27,2), Caucasian Eastern Mediterranean (26,2), Native Hawaiian or Pacific (14,2), Hui Chinese (11,1), Cape Verdian (4,2), Native American or Alaska Native (4,2), Isreali (2,2), Curacao (2,2), Dominican Republic (2,2), Afghanistan (2,2).


In [9]:
clean_replication = pd.read_csv(os.path.abspath(
                                os.path.join('__file__',
                                             '../..',
                                             'data',
                                             'Catalogue',
                                             'Synthetic',
                                             'new_replication_sample.csv')),
                                encoding='utf-8')
clean_replication_sum = pd.DataFrame(
    clean_replication.groupby(['Cleaned Ancestry']).sum())
clean_replication_sum.rename(
    columns={'Cleaned Ancestry Size': 'Replication Ancestry Sum', }, inplace=True)
clean_replication_count = clean_replication.groupby(
    ['Cleaned Ancestry']).count()
clean_replication_count.rename(
    columns={'Cleaned Ancestry Size': 'Replication Ancestry Count', }, inplace=True)
clean_replication_merged = clean_replication_sum.merge(
    pd.DataFrame(clean_replication_count['Replication Ancestry Count']),
    how='outer', left_index=True, right_index=True)
clean_replication_merged = clean_replication_merged.sort_values(
    by='Replication Ancestry Sum', ascending=False)
holder = ''
for index, row in clean_replication_merged.iterrows():
    holder = holder + \
        str(index) + ' (' + str(row['Replication Ancestry Sum']) + \
        ',' + str(row['Replication Ancestry Count']) + '), '
print('There are a total of ' + str(len(clean_replication_merged)) +
      ' ancestries found in the \'REPLICATION SAMPLE SIZE\' column.' +
      ' These are (number of people used used in all studies, number of' +
      ' studies included):\n\n' + holder[:-2] + '.')
There are a total of 150 ancestries found in the 'REPLICATION SAMPLE SIZE' column. These are (number of people used used in all studies, number of studies included):

European (37664311,3053), Asian (3528736,112), East Asian (2131174,240), Japanese (1507738,300), Icelandic (1181163,17), Han Chinese (1026731,258), African American (758634,354), Chinese (427391,185), Hispanic (410501,148), South Asian (402331,57), Korean (311243,148), British (251083,3), African (192176,76), European and Asian (136010,2), Indian Asian (118255,14), European and African American (99227,9), European and East Asian (87155,5), Finnish (83580,24), African American/Afro Caribbean (78188,42), Hispanic/Latino (62435,12), Indian (50713,47), Hispanic/Latino American (43200,6), European and Middle East/North African (29807,2), African America/Afro Caribbean (27661,1), Mexican (27148,17), Sardinian (25831,13), Latino (25299,17), Sub Saharan African (23752,26), European and  Rucphen (22789,1), Malay (22036,27), Brazilian (21631,17), African American and African (20809,1), Dutch (20367,3), Black and European (19090,1), Mexican American (18703,24), Northern European (16962,3), Filipino (16596,17), Ashkenazi Jewish (15801,16), European and Indian Asian (13615,1), African British (13364,3), Asian Indian (12906,7), Indo European (12659,5), African American and Afro Caribbean (11583,3), Southern Chinese (11058,3), Old Order Amish (10887,12), Punjabi Sikh (10261,5), Thai (10032,25), African American or Afro Caribbean (8972,3), Turkish (8855,11), Hispanic/Latin American (8622,5), European American (8614,3), Rucphen (8337,7), Mexican/Latino (8214,2), Saudi Arab (7692,12), Pima Indian (7495,9), African Caribbean (7480,2), American Indian (7039,7), French Canadian (6846,15), Swedish (6366,4), Pakistani (6115,6), German (5569,4), Vietnamese (5490,4), Indonesian (5253,7), Silk Road (5017,11), West African (4780,7), Kuwaiti (4704,4), Southern Indian (4600,2), Middle Eastern (4591,5), Danish (4258,2), Black (4111,12), Singaporean (3968,2), North Indian (3956,4), Russian (3950,2), Singaporean Chinese (3580,2), Gambian (3463,2), Dravidian (3260,4), She Chinese (3235,1), Iranian (3210,9), Northwestern European (2942,2), Hispanic / Latino (2826,1), Talana (2508,3), Amish (2429,4), African American and African Caribbean (2147,1), Arab (2120,13), Samoan (2102,1), European and Ashkenazi Jewish (2035,2), African American and European (1893,1), Seychellois (1875,8), Polish (1810,9), Costa Rican (1778,3), South and Central American (1769,1), Cilento (1709,2), Hispanic and European (1682,2), North African (1566,10), Polynesian (1536,2), Singaporean Indian (1520,1), Nepalese (1478,6), Caribbean Hispanic (1390,6), Latin American (1386,6), Oceania (1372,5), Southern European (1296,1), Native Hawaiian (1277,2), Hutterite (1255,2), Moroccan (1247,3), Latino American (1242,2), Indigenous Mexican (1178,2), European and Hispanic (1174,4), Asian American (1094,4), Jamaican (1089,2), Orcadian (1035,2), Afro Caribbean (962,1), Fruili Venezia Giulia (957,2), Val Borbera (910,1), Uygur Kazakh Chinese (840,2), Spanish (662,2), Sorbian (659,1), Mongolian (542,2), Portuguese (525,2), South African Black (481,2), Mayan (475,2), Bulgarian (450,2), Ethiopian (433,6), Malaysian Chinese (420,2), Taiwanese (400,1), Middle Eastern Arab (352,2), Jewish (344,2), Carlantino (322,1), Afro Caribbean and Sub Saharan African (321,1), Surinamese (284,1), Hispanic and Native American (282,1), French (276,1), Western European (253,1), Malaysian (212,2), Arab Isreali (198,2), Tibetan (161,1), Yoruban (146,2), Uyghur (99,2), Ashkenazi Jewish Isreali (96,2), Hispanic American (85,1), Jewish Isreali (74,2), European/Asian (74,2), African America and European (68,1), Italian (45,2), African and Asian (38,2), Aboriginal Canadian (15,2), Native Hawaiian or Pacific (7,1), Black African (5,1), Native American or Alaska Native (4,1), South African (2,1), Colored African (1,1).

Lets aggregate to create a dataframe for natives/indigenous people using a dictionary based classifier

In [10]:
native_dict = pd.read_csv(os.path.abspath(os.path.join('__file__', '../..',
                                         'data', 'Support',
                                         'native_classifier_dictionary.csv')),
                            index_col='Cleaned Ancestry')
native_df = pd.merge(clean_replication_merged, clean_initial_merged,
                   left_index=True,right_index=True, how='outer')
if len(np.setdiff1d(native_df.index.values, native_dict.index.values))>0:
    print('There are still ancestries which need classifying as unique!'+
         ' Add them to the dictionary: ' +
         ', '.join(np.setdiff1d(native_df.index.values,native_dict.index.values)))
else:
    print('Congratulations! The Native/Indigenous dictionary is fully up to date!')
native_df = native_df.fillna(0)
native_df['Total Ancestry Sum'] = native_df['Replication Ancestry Sum'] + \
                                  native_df['Initial Ancestry Sum']
native_df['Total Ancestry Count'] = native_df['Replication Ancestry Count'] + \
                                    native_df['Initial Ancestry Count']
native_df = pd.merge(native_df,native_dict,left_index=True, right_index=True)
print('We observe ' + str(len(native_df)) + ' unique terms for ancestry/ethnicity in total.')
print('Of these, ' + str(len(native_df[native_df['Native Population']!='Not Native'])) + 
      ' relate to native or Indigenous populations ancestry/ethnicity.')
Congratulations! The Native/Indigenous dictionary is fully up to date!
We observe 240 unique terms for ancestry/ethnicity in total.
Of these, 19 relate to native or Indigenous populations ancestry/ethnicity.
In [11]:
print('These relate to ' +
      str(len(native_df[native_df['NativeType']=='Indicated']['Native Population'].unique())) + 
      ' specific native/indigenous groups as implied by the free text.')
print('These are: '+
     ', '.join(native_df[native_df['NativeType']=='Indicated']['Native Population'].unique()))
print('Out of ' + str(int(native_df['Total Ancestry Count'].sum())) +
     ' mentions of ancestry\ethnicity, ' +
     str(int(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Count'].sum())) + 
     ' relate to native/indigenous groups as implied by the free text.' +
     '('+
      str(round(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Count'].sum()/
      native_df['Total Ancestry Count'].sum()*100,3)) +'%)')
print('Out of ' + str(int(native_df['Total Ancestry Sum'].sum())) +
     ' participants, ' +
     str(int(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sum())) + 
     ' are of native/indigenous groups as implied by the free text.'  +
     '('+
      str(round(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sum()/
      native_df['Total Ancestry Sum'].sum()*100,3)) +'%)')
print('Most commonly used natives implied by the free text is: ' +
     native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sort_values(ascending=False).index.values[0] +
     ' which contributed ' + 
      str(int(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sort_values(ascending=False)[0]))+
     ' people in terms of sample.')
These relate to 8 specific native/indigenous groups as implied by the free text.
These are: Aboriginal Canadians, Hispanic/Native American, Indigenous Mexicans, Indigenous Australian, Native Americans, Indigenous peoples of South America, Native Americans or Alaska Natives, Native Hawaiians
Out of 16449 mentions of ancestry\ethnicity, 34 relate to native/indigenous groups as implied by the free text.(0.207%)
Out of 167331039 participants, 9353 are of native/indigenous groups as implied by the free text.(0.006%)
Most commonly used natives implied by the free text is: Native Hawaiian which contributed 3179 people in terms of sample.
In [12]:
print('These relate to ' +
      str(len(native_df[native_df['NativeType']!='Not Native']['Native Population'].unique())) + 
      ' specific native/indigenous implied by the free text or which are manually classified.')
print('These are: '+
     ', '.join(native_df[native_df['NativeType']!='Not Native']['Native Population'].unique()))
print('Out of ' + str(int(native_df['Total Ancestry Count'].sum())) +
     ' mentions of ancestry\ethnicity, ' +
     str(int(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Count'].sum())) + 
     ' relate to native/indigenous populations implied by the free text or which are manually classified.' +
     '('+
      str(round(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Count'].sum()/
      native_df['Total Ancestry Count'].sum()*100,3)) +'%)')
print('Out of ' + str(int(native_df['Total Ancestry Sum'].sum())) +
     ' participants, ' +
     str(int(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sum())) + 
     ' are of native/indigenous populations implied by the free text or which are manually classified.' +
     '('+
      str(round(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sum()/
      native_df['Total Ancestry Sum'].sum()*100,3)) +'%)')
print('Most commonly used natives implied by the free text or which are manually classified is: ' +
     native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sort_values(ascending=False).index.values[0] +
     ' which contributed ' + 
      str(int(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sort_values(ascending=False)[0]))+
     ' people in terms of sample.')
These relate to 16 specific native/indigenous implied by the free text or which are manually classified.
These are: Aboriginal Canadians, Native American, Pacific Islander, Bashkirs, Saami, Hispanic/Native American, Indigenous Mexicans, Indigenous Australian, Mayan, Native Americans, Indigenous peoples of South America, Native Americans or Alaska Natives, Native Hawaiians, Pima Indians, Plains Indians, Samoan
Out of 16449 mentions of ancestry\ethnicity, 81 relate to native/indigenous populations implied by the free text or which are manually classified.(0.492%)
Out of 167331039 participants, 37528 are of native/indigenous populations implied by the free text or which are manually classified.(0.022%)
Most commonly used natives implied by the free text or which are manually classified is: Pima Indian which contributed 10746 people in terms of sample.

Lets first now do a quick check that our 'Broader' dictionary is up to date:

In [13]:
if len(Cat_Ancestry[Cat_Ancestry['Broader'].isnull()]) > 0:
    print('Perhaps we need to update the dictionary for new terms? Something like these ones:\n' +
         '\n'.join(Cat_Ancestry[Cat_Ancestry['Broader'].isnull()]['BROAD ANCESTRAL'].unique()))
else:
    print('Nice! Looks like our Broad to Broader dictionary is up to date!')
Nice! Looks like our Broad to Broader dictionary is up to date!

Lets put this number into tables to get a deeper understanding, first including a row which has at least one ancestry as NR, and then dropping all rows in which at least one recorded ancestry is NR:

In [14]:
Broad_Ancestral_Full_initial = pd.DataFrame(
    Cat_Ancestry[Cat_Ancestry.STAGE ==
                 'initial'].groupby(['Broader'])['N'].sum())
Broad_Ancestral_Full_initial.rename(
    columns={'N': 'N (Initial)'}, inplace=True)
Broad_Ancestral_Full_replication = pd.DataFrame(
    Cat_Ancestry[Cat_Ancestry.STAGE ==
                 'replication'].groupby(['Broader'])['N'].sum())
Broad_Ancestral_Full_replication.rename(
    columns={'N': 'N (Replication)'}, inplace=True)
Broad_Ancestral_Full = pd.merge(Broad_Ancestral_Full_initial,
                                Broad_Ancestral_Full_replication,
                                left_index=True, right_index=True)
Broad_Ancestral_Full['Total'] = Broad_Ancestral_Full['N (Initial)'] + \
    Broad_Ancestral_Full['N (Replication)']
Broad_Ancestral_Full['% Discovery'] = (
    Broad_Ancestral_Full['N (Initial)'] /
    Broad_Ancestral_Full['N (Initial)'].sum()) * 100
Broad_Ancestral_Full['% Replication'] = (
    Broad_Ancestral_Full['N (Replication)'] /
    Broad_Ancestral_Full['N (Replication)'].sum()) * 100
Broad_Ancestral_Full['% Total'] = (Broad_Ancestral_Full['Total'] /
                                   Broad_Ancestral_Full['Total'].sum()) * 100
Broad_Ancestral_Full.to_csv(os.path.abspath(
                            os.path.join('__file__',
                                         '../..',
                                         'tables',
                                         'Broad_Ancestral_Full.csv')))
Broad_Ancestral_Full
Out[14]:
N (Initial) N (Replication) Total % Discovery % Replication % Total
Broader
African 364922 143454 508376 0.292368 0.255475 0.280921
African Am./Caribbean 2289950 1002735 3292685 1.834664 1.785757 1.819489
Asian 11607308 9183808 20791116 9.299552 16.355320 11.488862
European 100678946 39199644 139878590 80.662032 69.810118 77.294829
Hispanic/Latin American 1520762 678596 2199358 1.218405 1.208502 1.215333
In Part Not Recorded 7789855 5040079 12829934 6.241082 8.975809 7.089631
Other/Mixed 564039 903493 1467532 0.451897 1.609019 0.810936

Drop the 'in part not recorded' rows (i.e. where the Broad Ancestral Category contains at least one NR):

In [15]:
Broad_Ancestral_NoNR = Broad_Ancestral_Full[['N (Initial)',
                                             'N (Replication)',
                                             'Total']]
Broad_Ancestral_NoNR = Broad_Ancestral_NoNR.drop('In Part Not Recorded')
Broad_Ancestral_NoNR['Total'] = Broad_Ancestral_NoNR['N (Initial)'] + \
    Broad_Ancestral_NoNR['N (Replication)']
Broad_Ancestral_NoNR['% Discovery'] = (
    Broad_Ancestral_NoNR['N (Initial)'] /
    Broad_Ancestral_NoNR['N (Initial)'].sum()) * 100
Broad_Ancestral_NoNR['% Replication'] = (
    Broad_Ancestral_NoNR['N (Replication)'] /
    Broad_Ancestral_NoNR['N (Replication)'].sum()) * 100
Broad_Ancestral_NoNR['% Total'] = (Broad_Ancestral_NoNR['Total'] /
                                   Broad_Ancestral_NoNR['Total'].sum()) * 100
Broad_Ancestral_NoNR.to_csv(os.path.abspath(
                            os.path.join('__file__', '../..', 'tables',
                                         'Broad_Ancestral_NoNR.csv')))
Broad_Ancestral_NoNR
Out[15]:
N (Initial) N (Replication) Total % Discovery % Replication % Total
Broader
African 364922 143454 508376 0.311830 0.280667 0.302357
African Am./Caribbean 2289950 1002735 3292685 1.956789 1.961849 1.958327
Asian 11607308 9183808 20791116 9.918578 17.968102 12.365532
European 100678946 39199644 139878590 86.031317 76.694027 83.192898
Hispanic/Latin American 1520762 678596 2199358 1.299509 1.327672 1.308070
Other/Mixed 564039 903493 1467532 0.481978 1.767682 0.872816

What are the broad ancestral fields before we visualise the broader field? Dropping all rows for multiple (comma separated entries):

In [16]:
GroupAnc = Cat_Ancestry[(Cat_Ancestry['BROAD ANCESTRAL'] != 'Other') &
                        (Cat_Ancestry['BROAD ANCESTRAL'].str.count(',') == 0)].\
    groupby(['BROAD ANCESTRAL'])['N'].sum().to_frame()
GroupAnc['Individuals (%)'] = (GroupAnc['N'] /
                               GroupAnc['N'].sum()) * 100
GroupAnc.sort_values(by='Individuals (%)',ascending=False)[0:10]
Out[16]:
N Individuals (%)
BROAD ANCESTRAL
European 139878590 78.978183
East Asian 16613415 9.380258
NR 10609771 5.990484
African American or Afro-Caribbean 3292685 1.859114
Hispanic or Latin American 2199358 1.241800
Asian unspecified 2196541 1.240210
South Asian 1688410 0.953309
African unspecified 340542 0.192277
South East Asian 121225 0.068446
Sub-Saharan African 107393 0.060636

Lets now make some bubble plots to visualise the raw broad ancestry field and our fields extracted from the free text strings:

In [17]:
countriesdict = {'African': '#4daf4a', 'African Am./Caribbean': '#984ea3',
                 'Asian': '#e41a1c', 'European': '#377eb8',
                 'Hispanic/Latin American': '#ff7f00', 'Other/Mixed': '#ffff33'}
output_path = os.path.abspath(os.path.join('__file__', "../..", "figures", "svg",
                                             "Figure_2.svg"))
plot_bubbles(output_path, Cat_Ancestry, Broad_Ancestral_NoNR, countriesdict)
print('The biggest single sample description is ' + str(Cat_Ancestry['N'].max()) + '.')
The biggest single sample description is 1030836.

We can also analyze how these aggregates change over time, and this was a key feature of Popejoy and Fullerton (2016). We can provide a much more granular arguement (merely remove '_NoNR' from the cell below to undertake an equivlent analysis with rows which contain some in part NR ancestries):

In [18]:
index = [x for x in range(2007, 2019)]
columns = ['European', 'Asian', 'African', 'Hispanic/Latin American',
           'Other/Mixed', 'African Am./Caribbean']
Cat_Ancestry_NoNR = Cat_Ancestry[
    Cat_Ancestry['Broader'] != 'In Part Not Recorded']
Broad_Ancestral_Time_NoNR_PC = pd.DataFrame(index=index, columns=columns)
for year in range(2007, 2019):
    for broaderancestry in Broad_Ancestral_NoNR.index.tolist():
        Broad_Ancestral_Time_NoNR_PC[broaderancestry.strip()][year] =\
            (Cat_Ancestry_NoNR[(
                Cat_Ancestry_NoNR['DATE'].str.contains(str(year))) &
                (Cat_Ancestry_NoNR['Broader'] ==
                 broaderancestry)]['N'].sum() /
             Cat_Ancestry_NoNR[
             (Cat_Ancestry_NoNR['DATE'].str.contains(
                 str(year)))]['N'].sum()) * 100
Broad_Ancestral_Time_NoNR_PC.to_csv(os.path.abspath(
                            os.path.join('__file__', '../..', 'tables',
                                         'Broad_Ancestral_Time_NoNR_PC.csv')))
Broad_Ancestral_Time_NoNR_PC.style
Out[18]:
European Asian African Hispanic/Latin American Other/Mixed African Am./Caribbean
2007 95.4688 2.13984 0.00542792 0.715245 1.18391 0.486807
2008 95.2884 2.94597 0 0.00180063 1.21894 0.544872
2009 88.1708 7.10488 0.258137 0.224117 3.36046 0.881605
2010 86.8535 9.89447 0.266794 0.05829 2.43575 0.491204
2011 78.2316 15.8233 0.15679 0.397565 1.70889 3.68185
2012 71.9806 19.4652 0.312952 0.884389 2.87411 4.48276
2013 82.1979 11.6865 0.394273 0.787581 0.617715 4.31598
2014 76.6056 18.6152 0.250182 1.15204 0.97794 2.39904
2015 87.8072 9.42814 0.283439 0.771371 0.533019 1.17688
2016 90.785 4.64818 0.167738 1.46629 1.09776 1.83503
2017 87.9593 6.33497 0.573358 2.32955 0.672269 2.13053
2018 74.4713 22.4694 0.233238 1.20412 0.198898 1.42301

We can focus on the initial discovery stage to calculate the number of individuals required per ancestry to unconver one hit. Note however that this does require some distributional assumptions (i.e. that the same diseases are being studied across ancestries) and assumptions about overlap in contribution to cohorts.

In [19]:
Cat_Ancestry_Initial = Cat_Ancestry[Cat_Ancestry['STAGE'] == 'initial']

Cat_Ancestry_NoDups = Cat_Ancestry_Initial.drop_duplicates(
    subset=['STUDY ACCESSION'],
    keep=False)[['PUBMEDID', 'STUDY ACCESSION',
                 'Broader','N']]
Cat_Ancestry_NoDups_merge = pd.merge(Cat_Ancestry_NoDups,
                                     Cat_Studies[['ASSOCIATION COUNT',
                                                  'STUDY ACCESSION',
                                                  'MAPPED_TRAIT']],
                                     how='left', on='STUDY ACCESSION')

listtoiterate = ['European', 'Other/Mixed', 'Asian','African',
                 'African Am./Caribbean', 'Hispanic/Latin American']
for ancestry in listtoiterate:
    temp = Cat_Ancestry_NoDups_merge[Cat_Ancestry_NoDups_merge[
        'Broader'].str.strip() == ancestry]
    print('The number of ' + ancestry + 's required to find one hit: ' +
          str(round(1 /
                    (temp['ASSOCIATION COUNT'].sum() / temp['N'].sum()), 1)))
The number of Europeans required to find one hit: 1501.2
The number of Other/Mixeds required to find one hit: 478.1
The number of Asians required to find one hit: 1352.9
The number of Africans required to find one hit: 779.0
The number of African Am./Caribbeans required to find one hit: 436.8
The number of Hispanic/Latin Americans required to find one hit: 393.4
In [20]:
 Cat_Ancestry_Initial['Broader'].unique()
Out[20]:
array(['European', 'Other/Mixed', 'Asian', 'In Part Not Recorded',
       'African Am./Caribbean', 'Hispanic/Latin American', 'African'],
      dtype=object)

Choropleth map of country of recruitment

This is a choropleth map of country of recruitment - note that this field is not quite fully populated in the Catalogue. Basemap code is loosely based on this. As we want to study the number of people recruited from each country, we can only utilize values in the ‘Country of Recruitment’ field when only one country is mentioned. For example: if N=100,000 and Country of Recruitment = {U.K., U.S.}, there is no way for us to know what the breakdown between the two countries is in the Catalogue (especially as the free text field may just contain 'European' ancestry). Our only option in this scenario is to drop such observations. We are faced with multiple entries equal to ‘NR’, which corresponds to ‘not recorded’. This reduces the number of studies to go into the map further 21%. This has no relationship to whether ancestry is coded, and the make_clean_CoR function summarizes these dropped rows below.

In [21]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
Clean_CoR = make_clean_CoR(Cat_Ancestry)
Cleaned_CoR_N = pd.DataFrame(Clean_CoR.groupby(['Cleaned Country'])['N'].sum())
Cleaned_CoR_N['Rank_N'] = Cleaned_CoR_N.rank(
    ascending=False, method='dense').astype(int)
Cleaned_CoR_count = pd.DataFrame(
    Clean_CoR.groupby(['Cleaned Country'])['N'].count())
Cleaned_CoR_count['Rank_count'] = Cleaned_CoR_count.rank(
    ascending=False, method='dense').astype(int)
Cleaned_CoR_count = Cleaned_CoR_count.rename(columns={'N': 'Count'})
Cleaned_CoR = pd.merge(Cleaned_CoR_count, Cleaned_CoR_N,
                       left_index=True, right_index=True)
del Cleaned_CoR.index.name
Cleaned_CoR['Count (%)'] = round((Cleaned_CoR['Count'] /
                                   Cleaned_CoR['Count'].sum()) * 100,2)
Cleaned_CoR['N (%)'] = round((Cleaned_CoR['N'] /
                               Cleaned_CoR['N'].sum()) * 100,2)
countrylookup = pd.read_csv(os.path.abspath(
                            os.path.join('__file__',
                                         '../..',
                                         'data',
                                         'ShapeFiles',
                                         'Country_Lookup.csv')),
                            index_col='Country')
country_merged = pd.merge(countrylookup,
                          Cleaned_CoR,
                          left_index=True,
                          right_index=True)
country_merged['Per Rec'] = round(country_merged['N']/pd.to_numeric(country_merged['2017population']),2)
country_merged.sort_values('Rank_N', ascending=True)[
    ['Continent', 'Count', 'N', 'Count (%)', 'N (%)', 'Per Rec']].to_csv(
    os.path.abspath(os.path.join('__file__',
                                 '../..',
                                 'tables',
                                 'CountryOfRecruitment.csv')))
Cleaning for single country of recruitment:
54.86% of the rows remain.
30.73% of the N remains.

Lets visualise this table, where the 'Per Rec' column relates to the population adjusted metric:

In [22]:
country_merged.sort_values('Rank_N', ascending=True)[
    ['Continent', 'Count', 'N', 'Count (%)', 'N (%)', 'Per Rec']].head(10)
Out[22]:
Continent Count N Count (%) N (%) Per Rec
United Kingdom Europe 662 22521698 10.54 40.50 0.34
United States North America 2576 10997635 41.01 19.78 0.03
Japan Asia 481 7940622 7.66 14.28 0.06
Iceland Europe 71 6409109 1.13 11.52 19.13
China Asia 500 2059693 7.96 3.70 0.00
Finland Europe 218 1193333 3.47 2.15 0.22
Korea, South Asia 256 857072 4.08 1.54 0.02
Netherlands Europe 175 663477 2.79 1.19 0.04
Germany Europe 175 434824 2.79 0.78 0.01
Australia Oceania 110 320458 1.75 0.58 0.01
In [23]:
output_path = os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'figures',
                                           'svg',
                                           'Figure_3.svg'))
choropleth_map(country_merged, 'N', 'OrRd', output_path)

Lets instead set the colourmap instead by the 'Per Rec' field:

In [24]:
output_path = os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'figures',
                                           'svg',
                                           'Figure_3_Blues.svg'))
choropleth_map(country_merged, 'Per Rec', 'Blues', output_path)

Lets now merge that via a country lookup to continent based file using data from within the shapefile itself (based on CIAWF).

In [25]:
country_merged_sumcount = country_merged[[
    'Continent', 'Count']].groupby(['Continent']).sum()
country_merged_sumN = country_merged[[
    'Continent', 'N']].groupby(['Continent']).sum()
country_merged_sums = pd.merge(
    country_merged_sumN, country_merged_sumcount,
    left_index=True, right_index=True)
country_merged_sums['N (%)'] = (
    country_merged_sums['N'] / country_merged_sums['N'].sum()) * 100
country_merged_sums['Count (%)'] = (
    country_merged_sums['Count'] / country_merged_sums['Count'].sum()) * 100
country_merged_sums.to_csv(os.path.abspath(
                           os.path.join('__file__',
                                        '../..',
                                        'tables',
                                        'ContinentOfRecruitment.csv')))
country_merged_sums
Out[25]:
N Count N (%) Count (%)
Continent
Africa 63616 62 0.114391 0.987104
Asia 11511832 1569 20.699918 24.980099
Europe 32560802 1827 58.548972 29.087725
North America 11117941 2674 19.991646 42.572839
Oceania 332398 119 0.597699 1.894603
Seven seas (open ocean) 1389 1 0.002498 0.015921
South America 24957 29 0.044876 0.461710

Who funds GWAS and what do they fund?

A simple descriptive tabulation

Lets do a super simple tabulation of the Top 20 GWAS funders (measured by 'GWAS contributions': i.e. a funder funding an authors time on a paper).

In [26]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()

AllFunders = FunderInfo.groupby(by='Agency').count()
AllFunders.index.name = None
AllFunders = AllFunders.reset_index()
AllFunders = AllFunders.rename(columns={
                               'index': 'Agency',
                               'PUBMEDID': 'Grant Contributions',
                               'GrantCountry': 'Country'})
AllFunders_withcountries = pd.merge(AllFunders[['Agency',
                                                'Grant Contributions']],
                                    FunderInfo[['Agency',
                                                'GrantCountry']].drop_duplicates(
                                        'Agency'),
                                    on='Agency', how='left')
AllFunders_withcountries = AllFunders_withcountries.set_index('Agency')
AllFunders_withcountries.index.name = None
AllFunders_withcountries['% of Total'] = round((
    AllFunders_withcountries['Grant Contributions'] /
    AllFunders_withcountries['Grant Contributions'].sum()) * 100, 2)
AllFunders_withcountries.sort_values(
    'Grant Contributions', ascending=False)[0:20]
Out[26]:
Grant Contributions GrantCountry % of Total
NHLBI 12731 United States 25.88
NCI 5234 United States 10.64
NIA NIH HHS 4117 United States 8.37
MRC 3548 United Kingdom 7.21
NIDDK NIH HHS 2704 United States 5.50
NIMH 2671 United States 5.43
Wellcome Trust 1835 United Kingdom 3.73
NHGRI 1811 United States 3.68
NCRR NIH HHS 1571 United States 3.19
NIAID NIH HHS 1190 United States 2.42
PHS HHS 1141 United States 2.32
NIAMS NIH HHS 1061 United States 2.16
NIAAA NIH HHS 909 United States 1.85
WHI NIH HHS 780 United States 1.59
NINDS 775 United States 1.58
NIDA 702 United States 1.43
NCATS NIH HHS 700 United States 1.42
NIGMS NIH HHS 609 United States 1.24
Cancer Research UK 604 United Kingdom 1.23
Intramural NIH HHS 587 United States 1.19

Lets print out some simple descriptives from this data:

In [27]:
print('There are ' + str(len(FunderInfo['Agency'].drop_duplicates())) +
      ' unique funders returned from PubMed Central.')
print('There are ' + str(len(FunderInfo['GrantID'].drop_duplicates())) +
      ' unique grants returned from PubMed Central.')
print('There are ' + str(len(FunderInfo['GrantCountry'].drop_duplicates())) +
      ' unique grant countries returned from PubMed Central.')
print('Each study has an average of ' +
      str(round(len(FunderInfo) / len(id_list), 2)) + ' grants funding it.')
grantgroupby = FunderInfo.groupby(['Agency', 'GrantID']).size().groupby(
    level=1).max().sort_values(ascending=False).reset_index()
print('The most frequently acknowledged grant is GrantID ' +
      grantgroupby['GrantID'][0] + '.\nThis grant is acknowledged ' +
      str(grantgroupby[0][0]) + ' times.')
There are 138 unique funders returned from PubMed Central.
There are 12790 unique grants returned from PubMed Central.
There are 8 unique grant countries returned from PubMed Central.
Each study has an average of 13.52 grants funding it.
The most frequently acknowledged grant is GrantID P30 DK063491.
This grant is acknowledged 207 times.

Save the unique agencies and grant IDs so that they can be manually eyeballed and cleaned for duplicates:

In [28]:
FunderInfo['Agency'].drop_duplicates().to_csv(os.path.abspath(
                                              os.path.join('__file__', 
                                                           '../..', 'data', 'PUBMED',
                                                           'Pubmed_Unique_Agencies.csv')),
                                             index=False)
FunderInfo['GrantID'].drop_duplicates().to_csv(os.path.abspath(
                                              os.path.join('__file__', 
                                                           '../..', 'data', 'PUBMED',
                                                           'Pubmed_Unique_GrantID.csv')),
                                              index=False)

International distribution of funders

From which countries do these grants come from?

In [29]:
TopCountryFunders = FunderInfo.groupby(by='GrantCountry').count()
TopCountryFunders = TopCountryFunders.rename(
    columns={'PUBMEDID': 'Number Of Studies'})
TopCountryFunders = TopCountryFunders.sort_values(
    'Number Of Studies', ascending=False)[['Number Of Studies']]
TopCountryFunders = TopCountryFunders.reset_index().rename(
    columns={'GrantCountry': 'Country of Funder'})
TopCountryFunders['Percent Funded (%)'] = (
    TopCountryFunders['Number Of Studies'] /
    TopCountryFunders['Number Of Studies'].sum()) * 100
TopCountryFunders = TopCountryFunders.set_index('Country of Funder')
TopCountryFunders.index.name = None
TopCountryFunders.head(10)
Out[29]:
Number Of Studies Percent Funded (%)
United States 41739 85.113889
United Kingdom 7045 14.366117
Canada 177 0.360937
International 65 0.132548
Austria 7 0.014274
Italy 5 0.010196
Switzerland 1 0.002039

What about the commas?

Lets first get a metric about splitting on the commas in the EFO field. We can make identical heatmaps by just replacing EFO_Parent_Mapped with EFO_Parent_Mapped_NoCommas. This is inherently making the assumption that a GWAS which has two mentioned EFO contributes equally to each of them: as much as a GWAS with one EFO (i.e. no comma to split on) contributes to that one specific EFO.

In [30]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)
print('There are ' + str(len(EFO_Parent_Mapped)) +
      ' rows of EFO terms after we split for commas.')
print('This indicates ' + str(len(EFO_Parent_Mapped) -\
                              len(EFO_Parent_Mapped_NoCommas)) +
      ' additional terms were mapped than for when we drop csvs.')
print('This indicates ' +
      str(len(EFO_Parent_Mapped['EFO term'].drop_duplicates())) +
      ' unique EFO terms to map to Parents.')
print('This is in comparison to ' +
      str(len(EFO_Parent_Mapped_NoCommas['EFO term'].drop_duplicates())) +
      ' unique EFO terms in Cat_Studies.')
There are 8053 rows of EFO terms after we split for commas.
This indicates 3678 additional terms were mapped than for when we drop csvs.
This indicates 1978 unique EFO terms to map to Parents.
This is in comparison to 1403 unique EFO terms in Cat_Studies.

What do they study?

In [31]:
mapped_trait_summary(EFO_Parent_Mapped, 'EFO term')
mapped_trait_summary(EFO_Parent_Mapped_NoCommas,'EFO term')
                             Sample   Studies  Associations
EFO term                                                   
Type II Diabetes Mellitus  1.663514  1.242082      0.671449
Schizophrenia              0.879867  1.204819      1.151546
Body Mass Index            3.737189  1.179978      2.575223
Breast Carcinoma           0.937038  0.956403      0.732750
Unipolar Depression        2.443244  0.956403      0.465202
Bipolar Disorder           0.350852  0.943982      0.323120
Alzheimers Disease         0.644987  0.919140      0.459472
HD LC measurement          0.769400  0.881878      1.077641
Triglyceride Meas.         0.750467  0.757670      1.033527
Colorectal Cancer          0.625566  0.745249      0.281871
Asthma                     0.315499  0.732828      0.315100
LD LC measurement          0.681996  0.707987      0.938424
Prostate Carcinoma         0.671055  0.695566      0.380411
Rheumatoid Arthritis       0.599189  0.583778      0.328849
Bone Density               0.279895  0.571358      0.882852
Coronary Heart Disease     0.483924  0.558937      0.154112
Systolic Blood Pressure    3.068725  0.546516      1.118317
                             Sample   Studies  Associations
EFO term                                                   
Type II Diabetes Mellitus  2.629278  1.463526      1.207744
Alzheimers Disease         0.590949  1.211983      0.411020
Body Mass Index            3.278644  1.097645      2.705882
Breast Carcinoma           1.243193  1.051909      1.727476
HD LC measurement          1.046499  1.006174      0.917349
Prostate Carcinoma         0.949307  0.960439      0.740134
Schizophrenia              0.713282  0.914704      1.688757
Colorectal Cancer          0.770211  0.914704      0.513775
Bone Density               0.431552  0.891836      0.845867
LD LC measurement          0.942622  0.868969      0.759494
Triglyceride Meas.         0.998113  0.754631      0.646314
Asthma                     0.459700  0.731763      0.381236
Bipolar Disorder           0.321246  0.708895      0.281459
Unipolar Depression        1.856082  0.686028      0.662695
Body Height                1.216159  0.640293      1.222636
Rheumatoid Arthritis       0.429307  0.640293      0.720774
Parkinson'S Disease        0.867314  0.594558      0.349963
In [32]:
mapped_trait_summary(EFO_Parent_Mapped, 'Parent term')
mapped_trait_summary(EFO_Parent_Mapped_NoCommas, 'Parent term')
                              Sample    Studies  Associations
Parent term                                                  
Other Meas.                31.939829  30.244690     40.862112
Neurological Disorder       8.955590  10.557695      4.967688
Other Disease               5.044254   7.676065      3.740518
Cancer                      6.495793   6.682400      3.168183
Response To Drug            1.905659   6.297354      3.858537
Lipid/Lipoprotein Meas.     3.292849   4.918644      5.884341
Other Trait                 3.740056   4.844119      2.241217
Digestive System Disorder   3.222564   4.297603      2.313404
Cardiovascular Disease      7.342520   4.123711      1.868827
Biological Process          5.206341   3.701404      5.248986
Immune System Disorder      2.705725   3.614458      2.245801
Cardiovascular Meas.        3.583971   3.353621      3.271306
Body Meas.                  8.399504   3.142467      9.333249
Hematological Meas.         4.429234   2.720159      5.155029
Metabolic Disorder          2.437143   2.484163      0.944726
Inflammatory Meas.          0.854492   0.894299      3.493595
Liver Enzyme Meas.          0.444476   0.447149      1.402480
                              Sample    Studies  Associations
Parent term                                                  
Other Meas.                28.040414  29.705008     34.609084
Neurological Disorder       8.642945  11.273725      7.182427
Other Disease               5.947303   8.323805      5.812360
Cancer                      8.058808   7.935056      5.731943
Lipid/Lipoprotein Meas.     4.231602   4.847930      4.162323
Cardiovascular Disease      9.429547   4.710725      3.466865
Other Trait                 3.757690   4.573519      2.878630
Digestive System Disorder   2.767986   4.299108      3.836188
Immune System Disorder      2.545207   4.299108      3.800447
Cardiovascular Meas.        4.279070   3.750286      2.512286
Hematological Meas.         6.467463   3.567345     11.350707
Biological Process          3.863894   3.475875      5.374535
Body Meas.                  7.454093   3.132861      5.341772
Metabolic Disorder          3.033741   2.423965      1.655994
Response To Drug            0.118138   2.401098      1.051378
Inflammatory Meas.          0.715173   0.846101      0.878630
Liver Enzyme Meas.          0.646923   0.434484      0.354430

Who's funding what?

In [33]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)

EFO_Parent_Mapped['Parent term'] = \
    EFO_Parent_Mapped['Parent term'].str.replace('Disorder', '')

FunderInfo_Parent = pd.merge(FunderInfo, EFO_Parent_Mapped,
                             left_on='PUBMEDID', right_on='PUBMEDID',
                             how='left')

funder_ancestry, funder_parent = make_funders(FunderInfo_Parent, FunderInfo, Cat_Ancestry)
funder_parent.to_csv(os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'tables',
                                           'Funders_and_Ancestry.csv')))
funder_ancestry.to_csv(os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'tables',
                                           'Funders_and_BroadEFO.csv')))
output_path = os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'figures',
                                           'svg',
                                           'Figure_4.svg'))
plot_heatmap(funder_ancestry, funder_parent, output_path)

Lets do the same thing, but instead do not split on the EFO commas:

In [34]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)

EFO_Parent_Mapped_NoCommas['Parent term'] = \
    EFO_Parent_Mapped_NoCommas['Parent term'].str.replace('Disorder', '')

FunderInfo_Parent = pd.merge(FunderInfo, EFO_Parent_Mapped_NoCommas,
                             left_on='PUBMEDID', right_on='PUBMEDID',
                             how='left')

funder_ancestry, funder_parent = make_funders(FunderInfo_Parent, FunderInfo, Cat_Ancestry)
output_path = os.path.abspath(os.path.join('__file__', '../..', 'figures',
                                         'svg', 'Figure_4_NoCommas.svg'))
plot_heatmap(funder_ancestry, funder_parent, output_path)

Who are the Authors?

Who are the most cited authors? What is their 'GWAS H-Index' calculated based only on their papers in the GWAS catalogue? This assumes unique forename + surname combinations, and that the same author is recorded consistently across studies. First a couple of snippets:

In [35]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
print('There are a total of ' + str(len(AuthorMaster)) +
      ' "authorship contributions".')
print('These contributions are made by ' +
      str(len(AuthorMaster['AUTHORNAME'].unique())) + ' unique authors.')
print('There are a total of ' +
      str(len(AuthorMaster) /
          len(AuthorMaster['PUBMEDID'].drop_duplicates())) +
      ' "authorship contributions per paper".')
print('The study with the most number of authors has ' +
      str(AuthorMaster.groupby(['PUBMEDID']).size().max()) +
      ' authors (PubMed ID: ' +
      str(AuthorMaster.groupby(['PUBMEDID']).size().idxmax()) + ')')
There are a total of 122141 "authorship contributions".
These contributions are made by 39893 unique authors.
There are a total of 33.71266905879106 "authorship contributions per paper".
The study with the most number of authors has 559 authors (PubMed ID: 22479202)

Calculate 'GWAS-H' indexes

Lets then calculate our GWAS-H indexes using citation data and paper counts for unique authors.

In [36]:
CitationCounts = pd.read_csv(os.path.abspath(
    os.path.join('__file__', "../..",
                 "data", "PUBMED", 'Pubmed_Cites.csv')))
AuthorMaster_withcites = pd.merge(
    AuthorMaster, CitationCounts, on=['PUBMEDID'], how='left')
allauthors_formerge = pd.DataFrame(
    AuthorMaster_withcites[['AUTHORNAME', 'citedByCount']])
allauthors_papercount = pd.DataFrame(
    allauthors_formerge['AUTHORNAME'].value_counts())
allauthors_citecount = pd.DataFrame(
    allauthors_formerge.groupby(by='AUTHORNAME')['citedByCount'].sum())
allauthors_merged = pd.merge(
    allauthors_papercount, allauthors_citecount, left_index=True,
    right_index=True)
allauthors_merged.columns = ['Papers', 'citedByCount']
allauthors_merged = allauthors_merged.sort_values(
    by='citedByCount', ascending=False)
allauthors_merged['GWAS-H'] = 0
counter = 0
for author in allauthors_merged.index:
    counter += 1
    temp = AuthorMaster_withcites[AuthorMaster_withcites['AUTHORNAME'] ==
                                  author].sort_values(by='citedByCount',
                                                      ascending=False).dropna()
    temp = temp.reset_index()
    temp = temp.drop('index', 1)
    for pubnumber in range(0, len(temp)):
        if pubnumber + 1 > temp['citedByCount'][pubnumber]:
            break
        else:
            allauthors_merged.loc[author, ('GWAS-H')] = int(pubnumber+1)
    sys.stdout.write('\r' + 'Calculating GWAS H-indices: finished ' +
                     str(counter) + ' of ' +
                     str(len(allauthors_merged.index)) + ' authors...')
allauthors_citecount.reset_index(inplace=True)
allauthors_papercount.reset_index(inplace=True)
allauthors_papercount.rename(
    columns={'AUTHORNAME': 'PAPERCOUNT', 'index': 'AUTHORNAME', },
    inplace=True)
Calculating GWAS H-indices: finished 39893 of 39893 authors...

Compare Author Metrics Between Genders

Calculate author centralities

Lets next calculate some measures of authorship centrality with Network-X. Note: this can take some time to calculate. To get it to run faster, change the CitedByCount to be something like 100 to filter for authors with a minimum of a hundred citations only (or change PAPERCOUNT>1)

In [37]:
AuthorMaster_sumcites = pd.merge(AuthorMaster, allauthors_citecount,
                                 left_on='AUTHORNAME', right_on='AUTHORNAME',
                                 how='left')
AuthorMaster_sumcitespapercount = pd.merge(AuthorMaster_sumcites,
                                           allauthors_papercount,
                                           left_on='AUTHORNAME',
                                           right_on='AUTHORNAME', how='left')
AuthorMaster_sumcitespapercount_filter_cites = AuthorMaster_sumcitespapercount[
    AuthorMaster_sumcitespapercount['citedByCount'] > 10]
AuthorMaster_sumcitespapercount_filtered =\
    AuthorMaster_sumcitespapercount_filter_cites[
        AuthorMaster_sumcitespapercount_filter_cites['PAPERCOUNT'] > 1]

G = nx.Graph()
counter = 0
alledges = []
for paper in AuthorMaster_sumcitespapercount_filtered['PUBMEDID'].unique():
    counter += 1
    temp = AuthorMaster_sumcitespapercount_filtered[
        AuthorMaster_sumcitespapercount_filtered['PUBMEDID'] == paper]
    if len(temp) > 1:
        templist = list(itertools.combinations(temp.AUTHORNAME, 2))
        for edge in templist:
            alledges.append(edge)
G.add_edges_from(list(set(alledges)))

print('This gives us a network with ' + str(len(G)) +
      ' nodes.\nNodes represents a unique author with more than 1 paper and more than 10 cites')

betcent = pd.Series(nx.betweenness_centrality(G), name='Betweenness')
allauthors_merged = allauthors_merged.merge(
    betcent.to_frame(), left_index=True, right_index=True)

degcent = pd.Series(nx.degree_centrality(G), name='Degree')
allauthors_merged = allauthors_merged.merge(
    degcent.to_frame(), left_index=True, right_index=True)
This gives us a network with 15094 nodes.
Nodes represents a unique author with more than 1 paper and more than 10 cites

We can then merge all this data with some data collected manually from web searches related to their country of employment, their current employer, etc. We then rank in three different ways to analyze overlap between the three metrics.

In [38]:
authorsupplemental = pd.read_csv(os.path.abspath(
    os.path.join('__file__', '../..',
                 'data', 'Support', 'Author_Supplmentary.csv')),
     encoding='latin-1', index_col=0)
allauthors_merged_withsupp = pd.merge(allauthors_merged,
                                      authorsupplemental,
                                      left_index=True, right_index=True,
                                      how='left')
allauthors_merged_withsupp['Betweenness'] = round(allauthors_merged_withsupp['Betweenness'],3)
allauthors_merged_withsupp['Degree'] = round(allauthors_merged_withsupp['Degree'],3)
allauthors_merged_withsupp['CitePercentile'] = round(allauthors_merged_withsupp.citedByCount.rank(pct=True),4)
allauthors_merged_withsupp[['Papers','citedByCount',
                           'GWAS-H','Betweenness',
                           'Degree','Country',
                           'Institution',
                           'CitePercentile']].sort_values(by='GWAS-H',
                                                         ascending=False).to_csv(os.path.abspath(
                                                                                 os.path.join('__file__',
                                                                                              '../..',
                                                                                              'tables',
                                                                                              'Authors.csv')))
allauthors_merged_withsupp.sort_values(by='GWAS-H', ascending=False).head(10)
Out[38]:
Papers citedByCount GWAS-H Betweenness Degree Country Institution CitePercentile
Kari Stefansson 177 27568 84 0.020 0.308 Iceland deCode Genetics 1.0000
Unnur Thorsteinsdottir 142 23633 77 0.006 0.241 Iceland deCode Genetics 0.9999
Albert Hofman 267 25534 76 0.013 0.345 U.S. University of Harvard 0.9999
Andre G Uitterlinden 280 23337 76 0.018 0.367 Netherlands Erasmus MC 0.9998
Cornelia M van Duijn 188 20879 71 0.008 0.294 Netherlands UMC Rotterdam 0.9997
Gudmar Thorleifsson 119 20408 70 0.006 0.232 Iceland deCode Genetics 0.9995
Christian Gieger 166 22562 70 0.011 0.272 Germany Helmholtz-Muenchen 0.9997
Panos Deloukas 109 20323 68 0.009 0.233 U.K. Queen Mary UoL 0.9995
H-Erich Wichmann 112 20266 68 0.007 0.220 Germany Helmholtz-Muenchen 0.9994
Fernando Rivadeneira 198 17976 65 0.009 0.282 Netherlands UMC Rotterdam 0.9992
In [39]:
allauthors_merged_withsupp.sort_values(by='Degree',
                                       ascending=False).head(10)
Out[39]:
Papers citedByCount GWAS-H Betweenness Degree Country Institution CitePercentile
Andre G Uitterlinden 280 23337 76 0.018 0.367 Netherlands Erasmus MC 0.9998
Albert Hofman 267 25534 76 0.013 0.345 U.S. University of Harvard 0.9999
Kari Stefansson 177 27568 84 0.020 0.308 Iceland deCode Genetics 1.0000
Cornelia M van Duijn 188 20879 71 0.008 0.294 Netherlands UMC Rotterdam 0.9997
Jerome I Rotter 188 17828 59 0.011 0.282 U.S. UCLA 0.9991
Fernando Rivadeneira 198 17976 65 0.009 0.282 Netherlands UMC Rotterdam 0.9992
Christian Gieger 166 22562 70 0.011 0.272 Germany Helmholtz-Muenchen 0.9997
Bruce M Psaty 168 12966 58 0.005 0.262 U.S. University of Washington 0.9965
Nicholas G Martin 164 13570 58 0.010 0.260 Australia QIMR Berghofer 0.9973
Vilmundur Gudnason 128 15375 59 0.003 0.258 Iceland I.H.A. 0.9983
In [40]:
allauthors_merged_withsupp.sort_values(by='citedByCount',
                                       ascending=False).head(10)
Out[40]:
Papers citedByCount GWAS-H Betweenness Degree Country Institution CitePercentile
Kari Stefansson 177 27568 84 0.020 0.308 Iceland deCode Genetics 1.0000
Albert Hofman 267 25534 76 0.013 0.345 U.S. University of Harvard 0.9999
Unnur Thorsteinsdottir 142 23633 77 0.006 0.241 Iceland deCode Genetics 0.9999
Andre G Uitterlinden 280 23337 76 0.018 0.367 Netherlands Erasmus MC 0.9998
Christian Gieger 166 22562 70 0.011 0.272 Germany Helmholtz-Muenchen 0.9997
Cornelia M van Duijn 188 20879 71 0.008 0.294 Netherlands UMC Rotterdam 0.9997
Mark I McCarthy 99 20681 62 0.003 0.188 U.K. University of Oxford 0.9996
Gudmar Thorleifsson 119 20408 70 0.006 0.232 Iceland deCode Genetics 0.9995
Panos Deloukas 109 20323 68 0.009 0.233 U.K. Queen Mary UoL 0.9995
H-Erich Wichmann 112 20266 68 0.007 0.220 Germany Helmholtz-Muenchen 0.9994

And then make a simple correlation matrix to check how highly related these metrics are:

In [41]:
allauthors_merged_withsupp[['citedByCount', 'GWAS-H',
                            'Betweenness', 'Degree', 'Papers']].corr()
Out[41]:
citedByCount GWAS-H Betweenness Degree Papers
citedByCount 1.000000 0.894947 0.512354 0.818765 0.846734
GWAS-H 0.894947 1.000000 0.533297 0.881147 0.943116
Betweenness 0.512354 0.533297 1.000000 0.484133 0.650234
Degree 0.818765 0.881147 0.484133 1.000000 0.851937
Papers 0.846734 0.943116 0.650234 0.851937 1.000000
In [42]:
print('There are a total of ' +
      str(len(allauthors_merged_withsupp)) + ' authors in the table.')
print('The person with the highest G-WAS H-Index is: ' +
      allauthors_merged_withsupp['GWAS-H'].idxmax())
print('The person with the highest Degree is: ' +
      allauthors_merged_withsupp['Degree'].idxmax())
print('The person with the highest citedByCount is: ' +
      allauthors_merged_withsupp['citedByCount'].idxmax())
print('The person with the most number of Papers is: ' +
      allauthors_merged_withsupp['Papers'].idxmax())
There are a total of 15095 authors in the table.
The person with the highest G-WAS H-Index is: Kari Stefansson
The person with the highest Degree is: Andre G Uitterlinden
The person with the highest citedByCount is: Kari Stefansson
The person with the most number of Papers is: Andre G Uitterlinden

Author gender

Lets consider the gender of each author by analyzing their forenames using the genderguesser library based on gender.c by Michael Jorg. We can directly compare our results to this wonderful paper. One of the key assumptions made here is to combine all 'mostly_' male and female names into their respective male/female categories.

In [43]:
gendet = gender.Detector()
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()

AuthorCounts = AuthorMaster.groupby(['PUBMEDID'])['AUTHORNAME'].count(
).to_frame().reset_index().rename(columns={"AUTHORNAME": "Author Count"})
AuthorMaster = pd.merge(AuthorMaster, AuthorCounts, how='left', on='PUBMEDID')
AuthorMaster['CleanForename'] = AuthorMaster['FORENAME'].map(
    lambda x: clean_names(x))
AuthorMaster['CleanGender'] = AuthorMaster['CleanForename'].map(
    lambda x: gendet.get_gender(x))
AuthorMaster['MaleFemale'] = AuthorMaster['CleanGender'].str.replace('mostly_',
                                                                     '')
AuthorMaster['isfemale'] = np.where(
    AuthorMaster['MaleFemale'] == 'female', 1, 0)
AuthorFirst = AuthorMaster[AuthorMaster['Author Count']
                           > 4].drop_duplicates(subset='PUBMEDID', keep='first')
AuthorLast = AuthorMaster[AuthorMaster['Author Count']
                          > 4].drop_duplicates(subset='PUBMEDID', keep='last')
AuthorUnique = AuthorMaster.drop_duplicates(subset='AUTHORNAME')

for gender_ in AuthorMaster['CleanGender'].unique():
    print(str(round((len(AuthorMaster[AuthorMaster['CleanGender'] == gender_]) /
                     len(AuthorMaster['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' authorship contributions in total')
    print(str(round((len(AuthorUnique[AuthorUnique['CleanGender'] == gender_]) /
                     len(AuthorUnique['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' authors contributions in total')
    print(str(round((len(AuthorFirst[AuthorFirst['CleanGender'] == gender_]) /
                     len(AuthorFirst['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' first authors in total')
    print(str(round((len(AuthorLast[AuthorLast['CleanGender'] == gender_]) /
                     len(AuthorFirst['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' last authors in total')

print('\nPercent of male author contributions: ' +
      str(round(len(AuthorMaster[AuthorMaster['MaleFemale'] == 'male']) /
                (len(AuthorMaster[(AuthorMaster['MaleFemale'] == 'male') |
                                  (AuthorMaster['MaleFemale'] == 'female')])) *
                100, 2)) + '%')

print('Percent of unique male authors: ' +
      str(round(len(AuthorUnique[AuthorUnique['MaleFemale'] == 'male']) /
                (len(AuthorUnique[(AuthorUnique['MaleFemale'] == 'male') |
                                  (AuthorUnique['MaleFemale'] == 'female')])) *
                100, 2)) + '%')

print('Percent of male first authors: ' +
      str(round(len(AuthorFirst[AuthorFirst['MaleFemale'] == 'male']) /
                (len(AuthorFirst[(AuthorFirst['MaleFemale'] == 'male') |
                                 (AuthorFirst['MaleFemale'] == 'female')])) *
                100, 2)) + '%')

print('Percent of male last authors: ' +
      str(round(len(AuthorLast[AuthorLast['MaleFemale'] == 'male']) /
                (len(AuthorLast[(AuthorLast['MaleFemale'] == 'male') |
                                (AuthorLast['MaleFemale'] == 'female')])) *
                100, 2)) + '%')

AuthorMaster_filtered = AuthorMaster[(AuthorMaster['MaleFemale'].str.contains(
    'male')) & (AuthorMaster['Author Count'] > 4)]
AuthorMaster_filtered_merged_bydisease = pd.merge(
    AuthorMaster_filtered, Cat_Studies[['PUBMEDID', 'DISEASE/TRAIT']],
    how='left', on='PUBMEDID')

meanfemales_bydisease = AuthorMaster_filtered_merged_bydisease.groupby(
    ['DISEASE/TRAIT'])['isfemale'].mean().to_frame()
numberofstudies_bydisease = Cat_Studies.groupby(
    ['DISEASE/TRAIT'])['DISEASE/TRAIT'].count().to_frame()
mergedgender_andcount_bydisease = pd.merge(
    numberofstudies_bydisease, meanfemales_bydisease, left_index=True,
    right_index=True)
mergedgender_andcount_bydisease = mergedgender_andcount_bydisease.sort_values(
    by='DISEASE/TRAIT', ascending=False)[0:10]
holdstring = 'Percent of authorships across 10 most commonly studied diseases:\n'
for index, row in mergedgender_andcount_bydisease.iterrows():
    holdstring = holdstring + \
        index.title() + ' (' + str(round(row['isfemale'], 2) * 100) + '%)\n'
print('\n' + holdstring[:-1])
45.55% male authorship contributions in total
37.81% male authors contributions in total
34.86% male first authors in total
49.19% male last authors in total
26.12% female authorship contributions in total
25.16% female authors contributions in total
27.17% female first authors in total
19.65% female last authors in total
21.12% unknown authorship contributions in total
28.9% unknown authors contributions in total
27.91% unknown first authors in total
23.89% unknown last authors in total
1.38% mostly_female authorship contributions in total
1.48% mostly_female authors contributions in total
1.36% mostly_female first authors in total
1.81% mostly_female last authors in total
4.48% andy authorship contributions in total
5.13% andy authors contributions in total
7.32% andy first authors in total
3.76% andy last authors in total
1.34% mostly_male authorship contributions in total
1.52% mostly_male authors contributions in total
1.39% mostly_male first authors in total
1.7% mostly_male last authors in total

Percent of male author contributions: 63.03%
Percent of unique male authors: 59.62%
Percent of male first authors: 55.96%
Percent of male last authors: 70.34%

Percent of authorships across 10 most commonly studied diseases:
Type 2 Diabetes (33.0%)
Body Mass Index (39.0%)
Colorectal Cancer (34.0%)
Breast Cancer (50.0%)
Schizophrenia (30.0%)
Prostate Cancer (41.0%)
Alzheimer'S Disease (40.0%)
Height (37.0%)
Asthma (37.0%)
Hdl Cholesterol (36.0%)
In [44]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)

ranks, countstudiesperEFO, meanfemale,\
AuthorMaster_EFO, AuthorMaster_Parent = make_meanfemale_andranks(AuthorMaster,
                                                                 EFO_Parent_Mapped)
output_path = os.path.abspath(os.path.join('__file__', "../..", "figures",
                                           'svg', 'Sup_Figure_1_Commas.svg'))
boxswarm_plot(meanfemale, ranks, output_path)

AuthorMaster_Parent.reset_index(inplace=True)
holdstring = 'Get the numbers for Parent Terms:\n'
for index, row in AuthorMaster_Parent.iterrows():
    holdstring = holdstring + row['Parent term'].title() + ' (' \
        + str(round(row['isfemale'], 2) * 100) + '%)\n'
print('\n' + holdstring[:-2])
holdstring = 'Get the numbers for Trait Terms (figure above):\n'
for index, row in AuthorMaster_EFO[
    AuthorMaster_EFO['EFO term'].isin(countstudiesperEFO)].sort_values(
        by='isfemale', ascending=False).iterrows():
    holdstring = holdstring + \
        row['EFO term'].title() + ' (' + str(round(row['isfemale'], 2) *
                                             100) + '%)\n'
print('\n' + holdstring[:-1])
Get the numbers for Parent Terms:
Biological Process (41.0%)
Body Meas. (41.0%)
Cancer (47.0%)
Cardiovascular Disease (33.0%)
Cardiovascular Meas. (33.0%)
Digestive System Disorder (31.0%)
Hematological Meas. (34.0%)
Immune System Disorder (36.0%)
Inflammatory Meas. (33.0%)
Lipid/Lipoprotein Meas. (36.0%)
Liver Enzyme Meas. (35.0%)
Metabolic Disorder (34.0%)
Neurological Disorder (35.0%)
Other Disease (34.0%)
Other Meas. (38.0%)
Other Trait (36.0%)
Response To Drug (34.0%

Get the numbers for Trait Terms (figure above):
Breast Carcinoma (51.0%)
Prostate Carcinoma (42.0%)
Body Mass Index (41.0%)
Bipolar Disorder (37.0%)
Alzheimers Disease (37.0%)
Colorectal Cancer (37.0%)
Asthma (36.0%)
Triglyceride Meas. (36.0%)
Hd Lc Measurement (35.0%)
Rheumatoid Arthritis (35.0%)
Unipolar Depression (35.0%)
Ld Lc Measurement (35.0%)
Type Ii Diabetes Mellitus (33.0%)
Schizophrenia (31.0%)

Lets do exactly the same thing as above, but drop the fields with commas in:

In [45]:
ranks, countstudiesperEFO, meanfemale, AuthorMaster_EFO, AuthorMaster_Parent = make_meanfemale_andranks(AuthorMaster, EFO_Parent_Mapped_NoCommas)
output_path = os.path.abspath(os.path.join('__file__', "../..", "figures",
                                           'svg', 'Sup_Figure_1_NoCommas.svg'))
boxswarm_plot(meanfemale, ranks, output_path)

AuthorMaster_Parent.reset_index(inplace=True)
holdstring = 'Get the numbers for Parent Terms:\n'
for index, row in AuthorMaster_Parent.iterrows():
    holdstring = holdstring + row['Parent term'].title() + ' (' \
        + str(round(row['isfemale'], 2) * 100) + '%)\n'
print('\n' + holdstring[:-2])
holdstring = 'Get the numbers for Trait Terms (figure above):\n'
for index, row in AuthorMaster_EFO[
    AuthorMaster_EFO['EFO term'].isin(countstudiesperEFO)].sort_values(
        by='isfemale', ascending=False).iterrows():
    holdstring = holdstring + \
        row['EFO term'].title() + ' (' + str(round(row['isfemale'], 2) *
                                             100) + '%)\n'
print('\n' + holdstring[:-1])
Get the numbers for Parent Terms:
Biological Process (38.0%)
Body Meas. (40.0%)
Cancer (48.0%)
Cardiovascular Disease (31.0%)
Cardiovascular Meas. (32.0%)
Digestive System Disorder (32.0%)
Hematological Meas. (34.0%)
Immune System Disorder (36.0%)
Inflammatory Meas. (28.999999999999996%)
Lipid/Lipoprotein Meas. (36.0%)
Liver Enzyme Meas. (36.0%)
Metabolic Disorder (34.0%)
Neurological Disorder (35.0%)
Other Disease (34.0%)
Other Meas. (36.0%)
Other Trait (36.0%)
Response To Drug (33.0%

Get the numbers for Trait Terms (figure above):
Breast Carcinoma (50.0%)
Prostate Carcinoma (42.0%)
Body Mass Index (40.0%)
Alzheimers Disease (37.0%)
Triglyceride Meas. (36.0%)
Asthma (36.0%)
Bone Density (36.0%)
Hd Lc Measurement (35.0%)
Unipolar Depression (35.0%)
Ld Lc Measurement (35.0%)
Colorectal Cancer (34.0%)
Bipolar Disorder (34.0%)
Type Ii Diabetes Mellitus (33.0%)
Schizophrenia (30.0%)

Compare Author Metrics Between Genders

In [46]:
allauthors_merged_reset = allauthors_merged.reset_index().rename(columns={'index': 'Name'})
allauthors_merged_reset['CleanForename'] = allauthors_merged_reset['Name'].map(lambda x: x.split(' ')[0]) 
allauthors_merged_reset['CleanGender'] = allauthors_merged_reset['CleanForename'].map(
    lambda x: gendet.get_gender(x))
allauthors_merged_reset['MaleFemale'] = allauthors_merged_reset['CleanGender'].str.replace('mostly_',
                                                                     '')
allauthors_merged_reset['isfemale'] = np.where(
   allauthors_merged_reset['MaleFemale'] == 'female', 1, 0)

print('The average GWAS-H Index for females is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='female']\
                ['GWAS-H'].mean(), 2)))
print('The average GWAS-H Index for males is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='male']\
                ['GWAS-H'].mean(), 2)))
print('The average number of papers published by females is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='female']\
                ['Papers'].mean(), 2)))
print('The average number of papers published by males is: ' +
     str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='male']\
               ['Papers'].mean(), 2)))
print('The average number of citations for females is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='female']\
                ['citedByCount'].mean(), 2)))
print('The average number of citations for males is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='male']\
                ['citedByCount'].mean(), 2)))
The average GWAS-H Index for females is: 4.85
The average GWAS-H Index for males is: 5.34
The average number of papers published by females is: 6.15
The average number of papers published by males is: 7.17
The average number of citations for females is: 648.44
The average number of citations for males is: 780.73

Cohorts and Consortium

Consortium/collectives

Lets use the PubMed database returns on collectives to analyze which consortium and study groups are acknowledged:

In [47]:
pubmed_collectives = pd.read_csv(os.path.abspath(
                                 os.path.join('__file__', '../..',
                                              'data', 'PUBMED',
                                              'Pubmed_CollectiveInfo.csv')),
                                 encoding='latin-1')
collect_dict = pd.read_csv(os.path.abspath(
                           os.path.join(
                               '__file__', '../..', 'data',
                               'Support', 'Collectives',
                               'Collective_Dictionary.csv')),
                           encoding='latin-1')
pubmed_collectives['COLLECTIVE'] = pubmed_collectives['COLLECTIVE'].str.strip()
pubmed_collectives['COLLECTIVE'] = pubmed_collectives['COLLECTIVE'].str.lower()
collect_dict['COLLECTIVE'] = collect_dict['COLLECTIVE'].str.strip(
).str.lower()
merged_collect = pd.merge(
    pubmed_collectives, collect_dict, how='left', on='COLLECTIVE')
if len(merged_collect[merged_collect['Clean_Name'].isnull()]) > 0:
    print('Danger! Correct collectives in Collective_Unverified.csv ' +
          'and add to the dictionary\n')
    unverified_collect = merged_collect[merged_collect['Clean_Name'].isnull(
    )]
    unverified_collect.to_csv(os.path.abspath(
                              os.path.join('__file__', '../..',
                                           'data', 'Support', 'Collectives',
                                           'Collective_Unverified.csv')),
                              encoding='latin-1')
else:
    print('The consortium dictionary is up to date!\n')
consortiumlist = []
for index, row in merged_collect.iterrows():
    if pd.isnull(row['Clean_Name']) is False:
        for consortium in row['Clean_Name'].split(';'):
            consortiumlist.append(consortium)
clean_collectives = pd.DataFrame(consortiumlist, columns=['Consortium'])
groupby_collectives = clean_collectives.groupby(
    ['Consortium'])['Consortium'].count().sort_values(ascending=False)
holdstring = 'The most frequently seen Consortium are:\n'
for index, row in groupby_collectives.to_frame()[0:10].iterrows():
    holdstring = holdstring + index + ' (' + str(row['Consortium']) + ')\n'
print(holdstring[:-2] + ')')

print('\nWe have seen a total of ' +
      str(len(groupby_collectives.to_frame())) + ' different collectives!')
print('A total of ' + str(len(pubmed_collectives['PUBMEDID'].drop_duplicates(
))) + ' papers are contributed to by at least one collective.')
print('A total of ' + str(groupby_collectives.sum()) +
      ' collective contributions are made.')
The consortium dictionary is up to date!

The most frequently seen Consortium are:
Wellcome Trust Case Control Consortium (49)
CHARGE (46)
Wellcome Trust Case Control Consortium 2 (36)
LifeLines Cohort Study (30)
DIAGRAM (29)
Alzheimer's Disease Neuroimaging Initiative (25)
CARDIOGRAM (24)
MAGIC (20)
GIANT Consortium (18)
23andMe (17)

We have seen a total of 681 different collectives!
A total of 844 papers are contributed to by at least one collective.
A total of 1654 collective contributions are made.

Datasets

Lets now simply wrangle together some of the manually collated data and consider the most frequently observed cohorts, with the aim being to analyze the resampling issue discussed earlier in the literature.

In [48]:
finaldataset = pd.read_csv(os.path.abspath(
    os.path.join('__file__', '../..', 'data',
                 'Support', 'Cohorts',
                 'manually_parsed_data.csv')),
    encoding='latin-1')
mylist = []
for index, row in finaldataset.iterrows():
    mylist = mylist + row['DATASETS'].split(';')
mylist = [x.strip().upper() for x in mylist]
df1 = pd.DataFrame({'Cohorts': mylist})
reader = csv.reader(open(os.path.abspath(os.path.join(
    '__file__', '../..', 'data', 'Support', 'Cohorts',
    'Dictionary_cohorts.csv')), 'r'))
d = {}
for row in reader:
    k, v = row
    d[k] = v
for key in d:
    df1['Cohorts'].replace(key, d[key], inplace=True)
df1['Cohorts'] = df1['Cohorts'].str.replace(
    'Rotterdam Study I (RS-I)', 'Rotterdam Study (RS)')
df1['Cohorts'].replace('&', 'and', inplace=True)
df1['Cohorts'] = df1['Cohorts'].str.upper()
newdf = pd.DataFrame(df1.groupby(['Cohorts'])[
                     'Cohorts'].count(), columns=['Count'])
newdf = pd.DataFrame({'Count': df1.groupby(['Cohorts']).size()}).reset_index()

Then merge this with manually collated data:

In [49]:
manual_cohort_for_merge = pd.read_csv(os.path.abspath(
    os.path.join('__file__', "../..", "data",
                 "Support", "Cohorts",
                 "manual_cohort_for_merge.csv")),
    encoding='latin-1', index_col=False)
merged_manual = pd.merge(newdf, manual_cohort_for_merge,
                         how='left', on='Cohorts')
merged_manual.set_index('Cohorts').sort_values(
    by='Count', ascending=False).drop('NO NAME').head(10).style
merged_manual.set_index('Cohorts').sort_values(
    by='Count', ascending=False).drop('NO NAME').to_csv(os.path.abspath(
                                                        os.path.join('__file__',
                                                                     '../..',
                                                                     'tables',
                                                                     'ManualLY_Curated_Cohorts.csv')))

Lets now print out a list of the 50 most commonly used datasets in case we need to reference any specific particularily prominent ones (e.g. deCODE, UKBioBank etc):

In [50]:
merged_manual=merged_manual.set_index('Cohorts').sort_values(by='Count', ascending=False)
holder='The 30 most commonly used datasets are:\n\n'
for index, row in merged_manual[0:30].iterrows():
    holder = holder + \
        str(index) + ' (' + str(row['Count']) + ' times)\n'
print(holder[:-2])
merged_manual[['Count']].to_csv(os.path.abspath(
                                os.path.join('__file__',
                                             '../..',
                                             'data',
                                             'Support',
                                             'Cohorts',
                                             'Cohort_Count.csv')), encoding='latin-1')
The 30 most commonly used datasets are:

ROTTERDAM STUDY (RS) (398 times)
NO NAME (297 times)
COOPERATIVE HEALTH RESEARCH IN THE REGION OF AUGSBURG (KORA) (255 times)
FRAMINGHAM HEART STUDY (FHS) (207 times)
ATHEROSCLEROSIS RISK IN COMMUNITIES STUDY (ARIC) (204 times)
CARDIOVASCULAR HEALTH STUDY (CHS) (179 times)
BRITISH 1958 BIRTH COHORT STUDY (B58C) (156 times)
UK ADULT TWIN REGISTER (TWINSUK) (140 times)
EUROPEAN PROSPECTIVE INVESTIGATION INTO CANCER (EPIC) (132 times)
NURSES HEALTH STUDY (NHS) (129 times)
STUDY OF HEALTH IN POMERANIA (SHIP) (127 times)
AGING GENE-ENVIRONMENT SUSCEPTIBILITY - REYKJAVIK STUDY (AGES) (119 times)
DECODE GENETICS (DECODE) (113 times)
ERASMUS RUCPHEN FAMILY STUDY (ERF) (104 times)
WELLCOME TRUST CASE CONTROL CONSORTIUM (WTCCC) (103 times)
MULTI-ETHNIC STUDY OF ATHEROSCLEROSIS (MESA) (98 times)
HEALTH, AGING, AND BODY COMPOSITION STUDY (HEALTH ABC) (92 times)
NORTHERN FINNISH BIRTH COHORT STUDY (NFBC) (87 times)
NETHERLANDS TWIN REGISTRY (NTR) (78 times)
ORKNEY COMPLEX DISEASE STUDY (ORCADES) (78 times)
LOTHIAN BIRTH COHORT (LBC) (78 times)
ESTONIAN GENOME CENTER OF THE UNIVERSITY OF TARTU (EGCUT) (75 times)
AVON LONGITUDINAL STUDY OF PARENTS AND CHILDREN (ALSPAC) (75 times)
COHORTE LAUSANNOISE (COLAUS) (75 times)
THE INVECCHIARE IN CHIANTI (INCHIANTI) (70 times)
WOMENS GENOME HEALTH STUDY (WGHS) (69 times)
HEALTH PROFESSIONAL FOLLOW-UP STUDY (HPFS) (69 times)
SARDINIA STUDY OF AGING (SARDINIA) (68 times)
BIOBANK JAPAN PROJECT (BBJ) (68 times)
CARDIOVASCULAR RISK IN YOUNG FINNS STUDY (YFS) (67 times
In [51]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
PUBMED_by_N = pd.DataFrame(Cat_Ancestry.groupby(['PUBMEDID'])['N'].agg('sum').sort_values(ascending=False))
pubmed_manual = pd.merge(PUBMED_by_N.reset_index(), finaldataset, left_on='PUBMEDID',
                                 right_on='PUBMEDID', how='left')
print('We have counted ' + str(len(df1['Cohorts'])) + ' (non-unique) cohorts in total')
print('We have counted ' + str(len(newdf)) + ' unique cohorts in total.')
print("We're currently missing " +
      str(len(set(PUBMED_by_N[0:1000].index)-set(finaldataset['PUBMEDID']))) + 
     " of the biggest 1000 papers grouped by summed ancestry!")
print('We currently have coverage for ' + 
      str(round((pubmed_manual[pubmed_manual['In Charge'].notnull()]['N'].sum()/
          PUBMED_by_N['N'].sum())*100,2)) +
      '% of all papers by N.')
print('We currently have coverage for ' + 
      str(round((len(finaldataset)/len(PUBMED_by_N))*100,2)) +
      '% of all papers in absolute.')
pubmed_manual.to_csv(os.path.abspath(
                             os.path.join('__file__',
                                          '../..',
                                          'data',
                                          'Support',
                                          'Cohorts',
                                          'Merged_Cohorts_Manual_Curation.csv')),
                     encoding='latin-1')
not_curated = pubmed_manual[~pubmed_manual['In Charge'].notnull()]['PUBMEDID']
not_curated.to_csv(os.path.abspath(os.path.join('__file__', '../..', 'data',
                                                'Support', 'Cohorts',
                                                'Outstanding_PUBMEDIDs_To_Curate.csv')),
                   encoding='latin-1')
We have counted 11851 (non-unique) cohorts in total
We have counted 2000 unique cohorts in total.
We're currently missing 67 of the biggest 1000 papers grouped by summed ancestry!
We currently have coverage for 83.14% of all papers by N.
We currently have coverage for 34.06% of all papers in absolute.